This study focuses on Spotify Top Songs from 1956 to 2019 as the main analysis target. By analyzing the characteristics of highly popular songs across different eras, we aim to identify common traits of the most popular songs, which can help predict the potential of newly released songs to become hits.
Every year, countless songs are released, and some become viral hits
with millions of streams. Perhaps your song could be this year’s
breakout hit.
By listening to past hit songs, we noticed that popular songs often
share similar characteristics in style, lyrics, and rhythm. Songs with
certain elements tend to achieve a certain level of popularity.
This led us to consider that analyzing high-traffic songs over the years
could help identify the common elements of hit songs and predict the
likelihood of a new song becoming popular.
Additionally, this analysis can help record labels allocate marketing
budgets efficiently, directing resources to songs with the highest
potential.
We obtained the Spotify All Time Top Songs Mega Dataset from Kaggle.
This study uses fourteen features for analysis (including Title, Artist,
Top Genre, Year, BPM, Energy, Danceability, Loudness (dB), Liveness,
Valence, Duration, Acousticness, Speechiness, Popularity).
Feature explanations: > Title: Song title
> Artist: Artist name
> Top Genre: Song genre
> Year: Release year of the song
> Beats per Minute (BPM): Song tempo
> Energy: Song energy level, higher values indicate higher
energy
> Danceability: Danceability of the song, higher values indicate
easier to dance along
> Loudness: Song loudness in decibels, higher values mean louder
under the same volume baseline
> Valence: Positivity of the song, higher values indicate more
positive mood
> Acousticness: Higher values indicate more acoustic elements in the
song
> Speechiness: Higher values indicate more spoken word
components
> Length: Song duration in minutes
> Popularity: Higher values indicate greater popularity, range from 0
to 100
To ensure clarity in analysis, data is segmented by decades from the
1950s to 2010s. Basic visualizations such as bar charts and scatter
plots are used to analyze Top Genre, BPM, musical attributes
(highlighted in red), and song duration.
We explore yearly trends in BPM and Loudness to examine if song tempo or
loudness has increased over the years, reflecting changing public
preferences.
Beyond basic analysis, we also study correlations among musical
attributes using a pheatmap, and examine the popularity of different
genres based on musical features to understand listener preferences.
Knowing the characteristics of hit songs over the years allows us to
create an interactive web tool. Users can input a song and get a
prediction of its likelihood to become a hit based on its
features.
Possible extensions: 1. Direct user interaction
2. Recommend potentially popular songs by genre
3. Incorporate other elements, e.g., whether the song contains explicit
lyrics, and how that affects popularity
Music is an art form, but in the era of big data, public preferences can also be quantified. We can predict whether new songs have hit potential, helping in strategic planning for music production and marketing.
spotify
namefreq<-spotify %>% count(Artist)
names(namefreq) <- c("word","freq")
# my_graph<-wordcloud2(namefreq)
# saveWidget(my_graph,"tmp.html",selfcontained = F)
# webshot("tmp.html","fig_1.png", delay =5, vwidth = 1200, vheight=800)
knitr::include_graphics("fig_1.png")
From the raw data, we observe the frequency of appearances of each artist. Many familiar names appear! (Note: We pre-generated the wordcloud as it can cause rendering issues in later plots.)
p<-spotify_genre_ad %>%
ggplot(aes(x=Year,group=Genre))+
geom_line(aes(color=Genre),position="identity",stat="count")+
transition_reveal(Year)
animate(p)
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
By visualizing the counts by genre, we see that album rock dominated early on, dance music rose around 2000, and the music market became more diverse over time.
year <- spotify$Year
sixties_songdata <- spotify %>% filter(Year>1959 & Year<1970)
seventies_songdata <- spotify %>% filter(Year>1969 & Year<1980)
eighties_songdata <- spotify %>% filter(Year>1979 & Year<1990)
nineties_songdata <- spotify %>% filter(Year>1989 & Year<2000)
thousands_songdata <- spotify %>% filter(Year>1999 & Year<2010)
tens_songdata <- spotify %>% filter(Year>2009 & Year<2020)
sixties_songdata %>% group_by(Genre) %>%
summarize(Popularity=mean(Popularity))%>%
arrange(desc(Popularity)) %>% subset(Popularity>60)%>%
hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
hc_xAxis(title = list(text = "歌曲種類"))%>%
hc_yAxis(title = list(text = "特定種類的平均熱門程度"))
The genres with average popularity above 60 show that Classic Soul was the most popular in the 1960s, but differences among genres were small.
sixties_songdata%>%
count(Genre)%>%
arrange(n)%>%
hchart(type = "treemap", hcaes(x = Genre, value = n, color = n))
Album rock was the most frequent genre, but its average popularity was not the highest, suggesting no direct correlation between quantity and popularity.
seventies_songdata %>% group_by(Genre) %>%
summarize(Popularity=mean(Popularity))%>%
arrange(desc(Popularity)) %>% subset(Popularity>60)%>%
hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
hc_xAxis(title = list(text = "歌曲種類"))%>%
hc_yAxis(title = list(text = "特定種類的平均熱門程度"))
Classic Soul remains the most popular on average in the 1970s. Album rock’s share increased, pushing down Adult Standard. Examining the popularity range shows Album Rock has the largest variability.
seventies_songdata%>%
count(Genre)%>%
arrange(n)%>%
hchart(type = "treemap", hcaes(x = Genre, value = n, color = n))
In the 1970s, the proportion of album rock increased, gradually compressing the second-largest genre, adult standard. Comparing with the previous chart, we can see that classic soul has the highest average popularity overall, but there are only four songs in the data. Therefore, we examine the range of popularity for each genre.
seventies_songdata %>% group_by(Genre) %>%
summarize(Popularity=range(Popularity))%>%
arrange(desc(Popularity)) %>%
hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
hc_xAxis(title = list(text = "歌曲種類"))%>%
hc_yAxis(title = list(text = "特定種類的熱門程度範圍"))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Genre'. You can override using the
## `.groups` argument.
The largest range is observed in album rock, suggesting that the large variation in popularity results in a lower average popularity for this genre.
eighties_songdata %>% group_by(Genre) %>%
summarize(Popularity=mean(Popularity))%>%
arrange(desc(Popularity)) %>% subset(Popularity>62)%>%
hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
hc_xAxis(title = list(text = "歌曲種類"))%>%
hc_yAxis(title = list(text = "特定種類的平均熱門程度"))
In the 1980s, several new genres achieved high average popularity. Interestingly, many of the top genres can be classified as having “exotic traditional” styles, such as celtic punk from Ireland, along with emerging British and Canadian music styles.
eighties_songdata%>%
count(Genre)%>%
arrange(n)%>%
hchart(type = "treemap", hcaes(x = Genre, value = n, color = n))
The proportion of album rock decreased significantly in the 1980s, replaced by more exotic-style music, suggesting that the U.S. music market became increasingly culturally diverse.
nineties_songdata %>% group_by(Genre) %>%
summarize(Popularity=mean(Popularity))%>%
arrange(desc(Popularity)) %>% subset(Popularity>62)%>%
hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
hc_xAxis(title = list(text = "歌曲種類"))%>%
hc_yAxis(title = list(text = "特定種類的平均熱門程度"))
In the 1990s, pop genres increased in popularity. The rise of pop punk corresponds to the popularity of dance hall culture. Rock and metal bands also became more popular.
nineties_songdata%>%
count(Genre)%>%
arrange(n)%>%
hchart(type = "treemap", hcaes(x = Genre, value = n, color = n))
Although album rock and alternative rock remained the main genres, their proportion gradually decreased, while pop genres gained popularity, consistent with the previous chart.
thousands_songdata %>% group_by(Genre) %>%
summarize(Popularity=mean(Popularity))%>%
arrange(desc(Popularity)) %>% subset(Popularity>70)%>%
hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
hc_xAxis(title = list(text = "歌曲種類"))%>%
hc_yAxis(title = list(text = "特定種類的平均熱門程度"))
Besides the continuing rise of pop music, hip hop also became highly popular, with many subgenres appearing among the top popular categories.
thousands_songdata%>%
count(Genre)%>%
arrange(n)%>%
hchart(type = "treemap", hcaes(x = Genre, value = n, color = n))
Dance pop replaced album rock in terms of proportion, but rock music still held a significant share when grouped together. Hip hop, though lower in proportion, was very popular.
tens_songdata %>% group_by(Genre) %>%
summarize(Popularity=mean(Popularity))%>%
arrange(desc(Popularity)) %>% subset(Popularity>70)%>%
hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
hc_xAxis(title = list(text = "歌曲種類"))%>%
hc_yAxis(title = list(text = "特定種類的平均熱門程度"))
After 2010, exotic-style music achieved high popularity. Listeners’ tastes became more diverse, but pop music remained dominant.
tens_songdata%>%
count(Genre)%>%
arrange(n)%>%
hchart(type = "treemap", hcaes(x = Genre, value = n, color = n))
spotify_characteristic <- spotify %>% select(Popularity,Energy,Danceability,Loudness,Liveness,Valence,Acousticness,Speechiness)
names(spotify_characteristic)<-c("Popularity", "Energy", "Danceability","Loudness","Liveness","Valence","Acousticness","Speechiness")
GGally::ggcorr(data = spotify_characteristic, method = c("complete", "pearson"))
Popularity shows some positive correlation with Energy and Danceability; other features show little or no significant correlation. Energy and Loudness are highly correlated, suggesting potential multicollinearity. We will next examine songs with Popularity > 70.
spotify_popularity70 <- subset(spotify, Popularity >= 70)
spotify.active70 <- spotify_popularity70 %>%
select(Popularity,Energy,Danceability,Loudness,
Liveness,Valence,Acousticness,Speechiness)
GGally::ggcorr(data = spotify.active70, method = c("complete", "pearson"))
For songs with Popularity > 70, only Loudness and Danceability maintain relatively high positive correlation, while other features show low or negligible correlation.
spotify1<-spotify[,-c(1:3)]
spotify_mc <- data.frame((spotify1))
lm_pop_to_char<-lm(Popularity~Year+Bpm+Energy+Danceability+Loudness+Liveness+Valence+Length+Acousticness+Speechiness,data=spotify_mc)
summary(lm_pop_to_char)
##
## Call:
## lm(formula = Popularity ~ Year + Bpm + Energy + Danceability +
## Loudness + Liveness + Valence + Length + Acousticness + Speechiness,
## data = spotify_mc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.469 -9.165 1.158 9.723 37.842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 553.833070 41.435305 13.366 < 2e-16 ***
## Year -0.256877 0.020997 -12.234 < 2e-16 ***
## Bpm 0.002382 0.011090 0.215 0.829934
## Energy -0.100553 0.027529 -3.653 0.000266 ***
## Danceability 0.154640 0.024416 6.334 2.96e-10 ***
## Loudness 1.375388 0.134148 10.253 < 2e-16 ***
## Liveness -0.091328 0.018534 -4.927 9.02e-07 ***
## Valence -0.028175 0.017075 -1.650 0.099083 .
## Length -0.003547 0.004774 -0.743 0.457526
## Acousticness -0.028104 0.014154 -1.986 0.047219 *
## Speechiness 0.337251 0.070584 4.778 1.90e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.37 on 1983 degrees of freedom
## Multiple R-squared: 0.1359, Adjusted R-squared: 0.1315
## F-statistic: 31.19 on 10 and 1983 DF, p-value: < 2.2e-16
From the previous correlation plot, potential multicollinearity is observed. We will check individual VIF values:
vif<-data.frame(vif(lm(Popularity~Year+Bpm+Energy+Danceability+Loudness+Liveness+Valence+Length+Acousticness+Speechiness,data=spotify_mc)))
vif<-vif %>% summarize(variables=c("Year","Bpm","Energy","Danceability","Loudness","Liveness","Valence","Length","Acousticness","Speechiness"),vif_value=vif.lm.Popularity...Year...Bpm...Energy...Danceability...Loudness...) %>% arrange(desc(vif_value))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
knitr::kable(vif, width = 100) %>%
kable_styling(bootstrap_options = "striped", full_width = F) %>%
row_spec(1, bold = T, color = "white", background = "#D7261E")
| variables | vif_value |
|---|---|
| Energy | 4.144251 |
| Loudness | 2.668141 |
| Valence | 2.007379 |
| Acousticness | 1.878791 |
| Danceability | 1.565367 |
| Year | 1.275841 |
| Length | 1.123852 |
| Bpm | 1.076494 |
| Speechiness | 1.075458 |
| Liveness | 1.070967 |
Energy has a VIF > 4, indicating multicollinearity. We check which variables are associated with Energy:
summary(lm(Energy~Year+Bpm+Danceability+Loudness+Liveness+Valence+Length+Acousticness+Speechiness,data=spotify_mc))
##
## Call:
## lm(formula = Energy ~ Year + Bpm + Danceability + Loudness +
## Liveness + Valence + Length + Acousticness + Speechiness,
## data = spotify_mc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.589 -7.230 -0.144 7.243 33.618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121.084515 33.682609 3.595 0.000332 ***
## Year -0.057188 0.017076 -3.349 0.000826 ***
## Bpm 0.025237 0.009027 2.796 0.005226 **
## Danceability -0.078886 0.019833 -3.977 7.22e-05 ***
## Loudness 3.272998 0.081053 40.381 < 2e-16 ***
## Liveness 0.104147 0.014934 6.974 4.18e-12 ***
## Valence 0.236554 0.012873 18.376 < 2e-16 ***
## Length 0.023902 0.003856 6.198 6.93e-10 ***
## Acousticness -0.264910 0.009893 -26.777 < 2e-16 ***
## Speechiness 0.387968 0.056902 6.818 1.22e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.91 on 1984 degrees of freedom
## Multiple R-squared: 0.7587, Adjusted R-squared: 0.7576
## F-statistic: 693.1 on 9 and 1984 DF, p-value: < 2.2e-16
Almost all variables are related to Energy. Therefore, we remove Energy to examine the effect on multicollinearity of other factors:
vif1<-data.frame(vif(lm(Popularity~Year+Bpm+Danceability+Loudness+Liveness+Valence+Length+Acousticness+Speechiness,data=spotify_mc)))
vif1<-vif1 %>% summarize(variables=c("Year","Bpm","Danceability","Loudness","Liveness","Valence","Length","Acousticness","Speechiness"),vif_value=vif.lm.Popularity...Year...Bpm...Danceability...Loudness...Liveness...) %>% arrange(desc(vif_value))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
knitr::kable(vif1) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| variables | vif_value |
|---|---|
| Valence | 1.715415 |
| Danceability | 1.552984 |
| Loudness | 1.464489 |
| Acousticness | 1.380043 |
| Year | 1.268668 |
| Length | 1.102504 |
| Bpm | 1.072269 |
| Speechiness | 1.050836 |
| Liveness | 1.045341 |
All VIF values are now below 4, reducing model error in subsequent regression analysis.
PCA without Energy:
spotify_test <- spotify_mc[,-c(3,11)]
pca_spotify<-prcomp(spotify_test,scale. = TRUE)
pca_spotify
## Standard deviations (1, .., p=9):
## [1] 1.3813552 1.2300727 1.0576690 1.0198534 0.9887333 0.9598953 0.8464142
## [8] 0.6841995 0.5800653
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5
## Year -0.2238406 -0.40828118 0.512756050 -0.32345368 0.13113839
## Bpm -0.1115834 -0.23095177 -0.565134582 0.19366987 0.59958954
## Danceability -0.3990834 0.46759173 0.270236804 0.09894569 -0.13453300
## Loudness -0.4840906 -0.37429374 0.126137593 -0.08400684 0.07613662
## Liveness -0.0905829 -0.17276110 -0.449740455 -0.46047232 -0.56840114
## Valence -0.4628039 0.46949379 -0.190431966 0.11180372 -0.01035366
## Length 0.1113730 -0.33015937 0.002742636 0.63348604 -0.50048917
## Acousticness 0.4882861 0.24802664 0.024352221 -0.37726925 0.08891998
## Speechiness -0.2657316 -0.02089687 -0.299389631 -0.26990890 -0.13347645
## PC6 PC7 PC8 PC9
## Year 0.14526643 -0.4648746 -0.172122773 -0.36280373
## Bpm 0.07351902 -0.4333267 0.003888681 0.15692517
## Danceability 0.17496027 -0.4048524 -0.034899680 0.57047218
## Loudness -0.19917662 0.2885415 0.642886482 0.24610027
## Liveness -0.33256940 -0.3049853 -0.084531346 0.11748738
## Valence -0.18116459 -0.1112328 0.220705729 -0.65008630
## Length 0.24525793 -0.2683873 0.279089579 -0.12928721
## Acousticness 0.18367211 -0.3056474 0.649351531 -0.02035608
## Speechiness 0.81602714 0.2794944 -0.020800028 -0.06480015
summary(pca_spotify)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.381 1.2301 1.0577 1.0199 0.9887 0.9599 0.8464 0.68420
## Proportion of Variance 0.212 0.1681 0.1243 0.1156 0.1086 0.1024 0.0796 0.05201
## Cumulative Proportion 0.212 0.3801 0.5044 0.6200 0.7286 0.8310 0.9106 0.96261
## PC9
## Standard deviation 0.58007
## Proportion of Variance 0.03739
## Cumulative Proportion 1.00000
PC1–PC3 together explain over half of the variance. To compare with noise, we simulate noise 100 times:
sim_noise_ev <- function(n, p) {
noise <- data.frame(replicate(p, rnorm(n)))
eigen(cor(noise))$values
}
evalues_noise <- replicate(100, sim_noise_ev(nrow(spotify),ncol(spotify)))
evalues_mean <- apply(evalues_noise, 1, mean)
screeplot(pca_spotify,type="line")
abline(h=1,col="grey",lty="dotted")
lines(evalues_mean, type="b",col="blue")
PC1, PC2, and PC3 are the most representative components of variance compared to noise.
top4.pca.eigenvector <- pca_spotify$rotation[,1:4]
first.pca <- top4.pca.eigenvector[, 1]
second.pca <- top4.pca.eigenvector[, 2]
third.pca <- top4.pca.eigenvector[, 3]
dotchart(first.pca[order(first.pca, decreasing=FALSE)] ,
main="Loading Plot for PC1",xlab="Variable Loadings", col="red")
PC1 shows the largest positive loading for Acousticness (acousticness). Length is close to zero and not considered.
first.pca
## Year Bpm Danceability Loudness Liveness Valence
## -0.2238406 -0.1115834 -0.3990834 -0.4840906 -0.0905829 -0.4628039
## Length Acousticness Speechiness
## 0.1113730 0.4882861 -0.2657316
We define PC1 as:
PC1 = -0.2238*Year -0.1116*BPM -0.3991*Danceability -0.4841*Loudness -0.0906*Liveness -0.4628*Valence +0.1114*Length +0.4883*Acousticness -0.2657*Speechiness
dotchart(second.pca[order(second.pca, decreasing=FALSE)] ,
main="Loading Plot for PC2",xlab="Variable Loadings", col="red")
PC2 shows largest positive loading for song length; other positive contributions include Loudness, Energy, and BPM.
second.pca
## Year Bpm Danceability Loudness Liveness Valence
## -0.40828118 -0.23095177 0.46759173 -0.37429374 -0.17276110 0.46949379
## Length Acousticness Speechiness
## -0.33015937 0.24802664 -0.02089687
We define PC2 as:
PC2 = 0.4083*Year +0.2310*BPM -0.4676*Danceability +0.3743*Loudness +0.1728*Liveness -0.4695*Valence +0.3302*Length -0.2480*Acousticness +0.0209*Speechiness
dotchart(third.pca[order(third.pca, decreasing=FALSE)] ,
main="Loading Plot for PC3",xlab="Variable Loadings", col="red")
PC3 shows largest positive loading for song length. PC1 relates most to acoustic features.
third.pca
## Year Bpm Danceability Loudness Liveness Valence
## 0.512756050 -0.565134582 0.270236804 0.126137593 -0.449740455 -0.190431966
## Length Acousticness Speechiness
## 0.002742636 0.024352221 -0.299389631
PC3 defined as:
PC3 = -0.5128*Year +0.5651*BPM -0.2702*Danceability -0.1846*Loudness +0.4497*Liveness +0.1904*Valence -0.0027*Length -0.0244*Acousticness +0.2994*Speechiness
Regression Using PCA Scores
pre <- predict(pca_spotify)
spotify_test$pc1 <- pre[,1]
spotify_test$pc2 <- pre[,2]
spotify_test$pc3 <- pre[,3]
spotify_test$Popularity <- spotify_mc$Popularity
lm.sol<-lm(Popularity~pc1+pc2+pc3, data=spotify_test)
summary(lm.sol)
##
## Call:
## lm(formula = Popularity ~ pc1 + pc2 + pc3, data = spotify_test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.565 -9.202 2.189 10.700 38.435
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.5266 0.3161 188.329 < 2e-16 ***
## pc1 -1.5989 0.2289 -6.986 3.85e-12 ***
## pc2 1.1723 0.2570 4.561 5.40e-06 ***
## pc3 -0.3002 0.2989 -1.004 0.315
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.11 on 1990 degrees of freedom
## Multiple R-squared: 0.03427, Adjusted R-squared: 0.03281
## F-statistic: 23.54 on 3 and 1990 DF, p-value: 5.719e-15
Popularity is highly correlated with PC1 and PC2 (p < 0.01).
coef<-coef(lm.sol)
coef
## (Intercept) pc1 pc2 pc3
## 59.5265797 -1.5988774 1.1722783 -0.3002102
Linear regression model: Popularity = 59.5266 - 1.1954PC1 - 1.4957PC2 + 0.3693*PC3
k_fold_mse <- function( k, model,outcome,dataset) {
shuffled_indicies <- sample(1:nrow(dataset))
dataset <- dataset[shuffled_indicies,]
fold_pred_errors <- sapply(1:k, \(i) {
fold_i_pe(i, k, dataset, model,outcome)
})
pred_errors <- unlist(fold_pred_errors)
mse<-\(errs) mean(errs^2)
c(is=mse(residuals(model)),oos=mse(pred_errors))
}
fold_i_pe <- function(i, k, dataset, model,outcome) {
folds <- cut(1:nrow(dataset),k, labels = FALSE)
test_indices <- which(folds==i)
test_set <- dataset[test_indices,]
train_set <- dataset[-test_indices,]
trained_model <- update(model,data=train_set)
predictions <- predict(trained_model, test_set)
dataset[test_indices,outcome] - predictions
}
pc_mse<-k_fold_mse(10,lm.sol,"Popularity",spotify_test)
pc_mse
## is oos
## 198.8105 199.4796
In-sample MSE: 198.8105, Out-of-sample MSE: 199.6508.
spotify_tree <- rpart(Popularity~Year+Bpm+Danceability+Loudness+Liveness+Valence+Length+Acousticness+Speechiness,data=spotify_test)
rpart.plot(spotify_tree)
mse_oos_decision <-mse_oos(spotify_test$Popularity,predict(spotify_tree,spotify_test))
mse_oos_decision
## [1] 180.535
plot(randomforest_spotify)
The plot shows the error gradually decreases during training, making the model closer to the actual values.
varImpPlot(randomforest_spotify)
The Variable Importance Plot shows the most important factors: Year, Length, and Danceability are highly important.
set.seed(1)
randomforest_spotify <- randomForest(Popularity~Year+Bpm+Danceability+Loudness+Liveness+Valence+Length+Acousticness+Speechiness,data=spotify_test,importane=T,proximity=T,do.trace=100)
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 100 | 177.8 86.35 |
## 200 | 175.5 85.24 |
## 300 | 175.3 85.17 |
## 400 | 175 85.01 |
## 500 | 174.2 84.61 |
We compare the MSE of PCA, Decision Tree, and Random Forest models:
mse_compare <- data.frame(model=c("PCA test","Decision tree","Random forest"),mse_oos = as.numeric(c(199.3977,180.535,174.2)))
mse_compare %>% arrange(-mse_oos)%>%
ggplot(aes(x=model,y=mse_oos,fill=model))+
geom_bar(stat="identity", width=0.5)+
scale_fill_brewer(palette="Greens")+
geom_text(aes(label = mse_oos), vjust = 2)
Random Forest achieves the lowest MSE, so we choose it for predictions.
We test the model with songs from 2019 using Random Forest to predict popularity:
new_song <- read.csv("New song.csv",header=T)
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'New song.csv'
names(new_song)<-c("Title","Artist","Year","Genre","Bpm", "Energy", "Danceability","Loudness","Liveness","Valence","Length","Acousticness","Speechiness","Popularity")
new_song$Length = as.integer(as.factor(new_song$Length))
new_song
Predict popularity using Random Forest:
predicted_data <- predict(randomforest_spotify, newdata=predict_new_song)
predicted_data
## 1
## 65.96297
Compare predicted vs actual values:
Popularity <- c("predicted","real")
Value <- c(round(predicted_data,3),new_song$Popularity)
compare <- data.frame(Popularity,Value)
compare %>% ggplot(aes(x=Popularity,y=Value,fill=Popularity))+
geom_bar(stat="identity", position=position_dodge(),width=0.5)+
scale_fill_brewer(palette="Greens")+
geom_text(aes(label = Value), vjust = 2)
Prediction: 65.963 vs actual: 75, difference = 9.037 (~12.05% of actual).
We designed a Shiny app to allow users to input features and get predicted popularity:
https://samuel0731.shinyapps.io/spotify_shiny_website/
knitr::include_graphics("demo1.png")
Top shows historical song data;
bottom-left: user inputs song name, artist, and features:
knitr::include_graphics("demo2.png")
Right panel shows predicted
popularity and comparison with other songs.
This study analyzed the evolution of music genres over decades using visualizations. Using PCA, Decision Tree, and Random Forest, we found Random Forest yields the lowest MSE, making it the best prediction model. Finally, we implemented a Shiny interactive platform for user-friendly predictions.